Summary

A GO Enrichment Analysis sets out to identify GO terms that are more related to a subset of genes than expected. The expectation of how frequent a GO term should appear is defined by the universe, being a sort of a background model.

The list of genes we are going to use as query has been obtained by the integration between B-ALL associated DMPs falling at enhancer regions as defined by the ensembl regulatory build and PCHi-C data (for now we are using Naive B).
The genes therefore are those which are possibly being dysregulated by long-range regulatory elements (enhancers).


To start, the subset of genes has to be read in.

The subset of genes we want to study consist of 1607 genes

29139 genes will be used as background



The GO Annotation database has to be retreived. ViSEAGO package has to option of using previous releases (GRCh37.p13). We are going to be using the Ensembl database, as out genes are defined by ensembl IDs.

A small explanation of why this package is better that clusterprofiler (for example) is summarised in the table found at the following link: link

## Cache found

Functional Enrichment Analysis (BP)

Using the classic algorithm, which tests each GO term independently, and fisher as a statistical test, which is based on gene counts, the enrichment analysis is performed (taking into account the background we have predefined)

>The p-values have not been corrected for multiple testing, as we are performing an exploratory analysis and in many cases adjusted p-values may be misleading!


Semantic Similarity Analysis

Semantic Similarity is defined as how similar to GOterms are based on their meanings. Therefore, SS are used to group together enriched GO terms according to their annotation and their topological position in the GO graph.

I am calculating the “distance” between to terms using the Wang method, a gragh-based method which mantains the topology of the GO graph throughout the analysis.
There are also other types of methods based solely on Information Content…

## 'select()' returned 1:1 mapping between keys and columns

Clustering of enriched GO Terms

Another step that can be taken is clustering of GO terms. Using this information makes it easier to understand the data at hand.

Distances between GO terms are calculated using the Best Combination Approach (BMA) calculates the maximum similarity over all pairs of GO terms between two GO term sets, averaged with its reciprocal to obtain symmetric similarity.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

With this clustering approach, a total of 26 clusters are identified.
The names of the clusters are the first common GO term ancestor of the cluster.